Generating Chinese Named Entity Data from a Parallel Corpus

نویسندگان

  • Ruiji Fu
  • Bing Qin
  • Ting Liu
چکیده

Annotating Named Entity Recognition (NER) training corpora is a costly process but necessary for supervised NER systems. This paper presents an approach to generate large-scale Chinese NER training data from an EnglishChinese discourse level aligned parallel corpus. Difficulty of NER is different among languages due to their unique features. For example, the performance of English NER systems is usually higher than the Chinese ones on average. In our method, we first employ a high performance NER system on one side of a bilingual corpus. And then, we project the NE labels to the other side according to the word level alignment. At last, we select high-quality labeled sentences using different strategies and generate an NER training corpus. In our experiments, we generate a Chinese NER corpus with 167,100 sentences through an EnglishChinese parallel corpus. The system trained on the automatically generated corpus attains a comparable result with the one trained on the manuallyannotated corpus. Further experiments show that the NER performance is significantly improved on two different evaluation sets by using the generated training data as an additional corpus to the manually-labeled data.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Finding and Typing New Named Entities in Tibetan from Chinese-Tibetan Parallel Corpora

Currently there is much interest in the automatic acquisition of entities, with the goal of Named Entity Recognition (NER). However previous work has focused primarily on major languages, with the large, structured, and semantically rich knowledge bases and using the large corpus with annotated NER tags. In this paper, we describe a method for Chinese-Tibetan bilingual named entity recognition ...

متن کامل

Multi-feature Based Chinese-English Named Entity Extraction from Comparable Corpora

Bilingual Named Entity Extraction is important to some cross language information processes such as machine translation (MT), cross-lingual information retrieval (CLIR), etc. A lot of previous work extracted bilingual Named Entities from parallel corpus. Here we propose a multifeature based method to extract bilingual Named Entities from comparable corpus. We first recognize the Chinese and Eng...

متن کامل

پیکره اعلام: یک پیکره استاندارد واحدهای اسمی برای زبان فارسی

Named entity recognition (NER) is a natural language processing (NLP) problem that is mainly used for text summarization, data mining, data retrieval, question and answering, machine translation, and document classification systems. A NER system is tasked with determining the border of each named entity, recognizing its type and classifying it into predefined categories. The categories of named...

متن کامل

Toward a Name Entity Aligned Bilingual Corpus

This paper describes a co-training framework in which, through named entity aligned bilingual text, named entity taggers can complement and improve each other via an iterative process. This co-training approach allows us to 1) apply our method to not only parallel but also comparable text, greatly extending the applicability of the approach; and to 2) adapt named entity taggers to new domains; ...

متن کامل

Using Word Embeddings to Translate Named Entities

In this paper we investigate the usefulness of neural word embeddings in the process of translating Named Entities (NEs) from a resource-rich language to a language low on resources relevant to the task at hand, introducing a novel, yet simple way of obtaining bilingual word vectors. Inspired by observations in (Mikolov et al., 2013b), which show that training their word vector model on compara...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011